Big Data Categorization for Arabic Text Using Latent Semantic Indexing and Clustering
نویسنده
چکیده
Documents categorization is an important field in the area of natural language processing. In this paper, we propose using Latent Semantic Indexing (LSI), singular value decomposing (SVD) method, and clustering techniques to group similar unlabeled document into pre-specified number of topics. The generated groups are then categorized using a suitable label. For clustering, we used Expectation–Maximization (EM), Self-Organizing Map (SOM), and K-Means algorithms. In our experiments, we use a corpus that contains 1000 documents of ten topics (100 document for each topic. The results show that using LSI and clustering techniques for Arabic text categorization achieves good performance. The results show that EM clustering method outperforms other investigated clustering methods with an average categorization accuracy of 89%. Keywords—Arabic Text, Categorization, Classification, Clustering, Unsupervised Learning.
منابع مشابه
Support Vector Machines for Text Categorization Based on Latent Semantic Indexing
Text Categorization(TC) is an important component in many information organization and information management tasks. Two key issues in TC are feature coding and classifier design. In this paper Text Categorization via Support Vector Machines(SVMs) approach based on Latent Semantic Indexing(LSI) is described. Latent Semantic Indexing[1][2] is a method for selecting informative subspaces of featu...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملClassification and clustering methods for documents by probabilistic latent semantic indexing model
Based on information retrieval model especially probabilistic latent semantic indexing (PLSI) model, we discuss methods for classification and clustering of a set of documents. A method for classification is presented and is demonstrated its good performance by applying to a set of benchmark documents with free format (text only). Then the classification method is modified to a clustering metho...
متن کاملChinese Text Categorization via Bottom-Up Weighted Word Clustering
Most of the researches on text categorization are focus on using bag of words. Some researches provided other methods for classification such as term phrase, Latent Semantic Indexing, and term clustering. Term clustering is an effective way for classification, and had been proved as a good method for decreasing the dimensions in term vectors. The authors used hierarchical term clustering and ag...
متن کاملText Categorization and Information Retrieval Using WordNet Senses
In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct us...
متن کامل